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Abstract 

In  this  paper,  we  introduce  a  novel  set  of  features  for  robust  object  recognition,  which  exhibits  outstanding 
performances  on  a  variety  of  object  categories  while  being  capable  of  learning  from  only  a  few  training 
examples.  Each  element  of  this  set  is  a  complex  feature  obtained  by  combining  position-  and  scale-tolerant 
edge-detectors  over  neighboring  positions  and  multiple  orientations. 

Our  system  -  motivated  by  a  quantitative  model  of  visual  cortex  -  outperforms  state-of-the-art  systems  on 
a  variety  of  object  image  datasets  from  different  groups.  We  also  show  that  our  system  is  able  to  learn  from 
very  few  examples  with  no  prior  category  knowledge.  The  success  of  the  approach  is  also  a  suggestive 
plausibility  proof  for  a  class  of  feed-forward  models  of  object  recognition  in  cortex.  Finally,  we  conjecture 
the  existence  of  a  universal  overcomplete  dictionary  of  features  that  could  handle  the  recognition  of  all 
object  categories. 
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1  Introduction 

Most  state-of-the-art  object  recognition  systems  appear 
to  be  hand-crafted  and  heuristically  optimized  for  one 
object  category,  i.e.,  faces  [1-3],  cars  [2]  or  pedestri¬ 
ans  [4] .  Typically  these  systems  require  a  large  number 
of  segmented  training  examples.  This  is  in  sharp  con¬ 
trast  with  primates'  ability  to  learn  to  categorize  objects 
from  only  very  few  unsegmented  examples. 

Recently,  systems  have  been  presented  that  can  learn 
to  recognize  many  objects  (one  at  a  time)  using  an  un¬ 
segmented  training  set  [5,  6].  These  methods  recog¬ 
nize  highly  informative  object  components  and  their 
spatial  relations  called  constellations.  Not  only  did 
those  methods  achieve  good  performance  but  were  also 
shown  to  work  with  a  very  small  training  set  contain¬ 
ing  few  positive  examples  [6].  Yet  another  striking  dif¬ 
ference  between  these  recent  systems  and  the  state-of- 
the-art  single-object-recognition  systems  (e.g.,  MERL's 
AdaBoost-based  face  and  pedestrian  detection  systems, 
or  MobileEye's  SVM-based  car  detection  system)  is  that 
they  use  generative  (Bayesian)  algorithms. 

In  this  work  we  present  a  system  that  is  simpler  than 
constellation  models  [5,  6] :  it  uses  discriminative  meth¬ 
ods  and  does  not  make  use  of  any  local  object  geometry. 
Yet  it  is  able  to  learn  from  very  few  examples  and  to  per¬ 
form  significantly  better  than  all  other  systems  we  have 
tested.  Our  system  first  computes  a  set  of  biologically- 
inspired  C2  features  learned  from  the  positive  training 
set.  We  then  run  a  standard  classifier  on  the  vector  of 
features  obtained  from  the  input  image.  We  report  re¬ 
sults  using  both  linear  SVM  and  gentleBoost.  Since  the 
source  codes  are  readily  available,  our  results  should  be 
easy  to  reproduce  and  extend. 

Other  existing  features.  Hierarchical  approaches  to 
generic  object  recognition  have  become  increasingly 
popular  over  the  years.  They  have  been  shown  to  out¬ 
perform  non-hierarchical  single  template  (holistic)  ob¬ 
ject  recognition  approaches  on  a  variety  of  object  recog¬ 
nition  tasks  (e.g.,  face-detection  [7]).  Recognition  is  usu¬ 
ally  done  in  two  steps:  target  features  (also  called  com¬ 
ponents  [4,  7],  parts  [8]  or  fragments  [9])  are  first  com¬ 
puted  and  then  passed  to  a  combination  classifier  for 
final  analysis.  Constellation  models  using  generative 
methods  have  been  proposed  in  [5,  6,  8].  A  robust  face- 
detection  system  was  built  with  a  two-layer  SVM  sys¬ 
tem  in  [7]  and  variants  of  boosting  algorithms  were  pre¬ 
sented  for  fast  face-detection  [3]  and  multi-class  [10]  ob¬ 
ject  recognition  approaches. 

One  limitation  of  those  template-matching-based  fea¬ 
tures  is  that  they  do  not  capture  adequately  variations 
in  the  object  appearance:  they  are  very  selective  for  a 
target  shape  but  lack  invariance  with  respect  to  object 
transformations.  At  the  other  extreme,  histogram-based 
descriptors  [11, 12]  have  been  shown  to  be  very  robust 


with  respect  to  object  transformations.  The  SIFT  features 
[11],  for  instance,  have  been  shown  to  excel  in  the  re¬ 
detection  of  a  previously  seen  object  under  new  image 
transformations. 

However,  as  we  confirmed  experimentally  (see  sec¬ 
tion  4),  with  such  degree  of  invariance,  it  is  very  un¬ 
likely  that  those  features  could  perform  well  on  a 
generic  object  recognition  task.  We  here  propose  a  new 
set  of  features  that  exhibit  just  the  right  trade-off  be¬ 
tween  invariance  and  selectivity.  They  are  much  more 
flexible  than  components  [4]  or  fragments  [9]  and  more 
selective  than  local  descriptors.  Though  they  are  not 
strictly  invariant  to  rotation,  invariance  to  rotation  can 
be  introduced  via  the  training  set  (e.g.,  by  introducing 
rotated  versions  of  the  original  input). 

Biological  visual  systems  as  guides.  Because  humans 
and  primates  outperform  in  almost  any  measure  the 
best  machine  vision  systems,  building  a  system  that  em¬ 
ulates  object  recognition  in  cortex  has  always  been  an 
attractive  idea.  However,  for  the  most  part,  the  use  of 
visual  neuroscience  in  computer  vision  has  been  limited 
to  a  justification  of  Gabor  filters.  No  real  attention  has 
been  given  to  biologically  plausible  features  of  higher 
complexity.  While  mainstream  computer  vision  has  al¬ 
ways  been  inspired  and  challenged  by  human  vision,  it 
seems  to  never  have  advanced  past  the  very  first  stage 
of  processing  in  the  simple  cells  of  VI.  Models  of  bi¬ 
ological  vision  [13-16]  have  not  been  extended  to  deal 
with  real-world  object  recognition  tasks  and  tested  on 
them. 

The  standard  model  of  visual  cortex.  Our  system  fol¬ 
lows  the  standard  model  of  object  recognition  in  primate 
cortex  [17].  The  model  itself  attempts  to  summarize  in  a 
quantitative  way  what  most  visual  neuroscientists  gen¬ 
erally  agree  on:  the  first  few  hundred  milliseconds  of 
visual  processing  in  primate  cortex  follows  a  mostly 
feed-forward  hierarchy.  At  each  stage,  the  receptive 
field  of  the  neuron  (i.e.,  the  part  of  the  visual  field  that 
could  potentially  elicit  a  neuron's  response)  tends  to  get 
larger  along  with  the  complexity  of  its  preferred  stimuli 
(i.e.,  the  set  of  stimuli  that  are  susceptible  to  elicit  a  neu¬ 
ron's  response). 

In  its  simplest  form,  the  standard  model  consists  of 
four  layers  of  computational  units  where  simple  S  units 
alternate  with  complex  C  units.  The  S  units  combine 
their  inputs  with  Gaussian-like  tuning  to  increase  ob¬ 
ject  selectivity.  The  C  units  pool  their  inputs  through  a 
maximum  operation,  thereby  introducing  invariance  to 
scale  and  translation.  The  standard  model  has  been  able 
to  duplicate  quantitively  the  generalization  properties 
exhibited  by  neurons  in  inferotemporal  monkey  cor¬ 
tex  (the  so-called  view-tuned  units)  that  remain  highly 
selective  for  particular  objects  (a  face,  a  hand,  a  toilet 
brush)  while  being  invariant  to  range  of  scales  and  po¬ 
sitions. 
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The  standard  model  in  its  simplest  version  [15]  used 
a  very  simple  static  dictionary  of  features.  It  was  sug¬ 
gested  that  features  from  the  third  and  higher  layer  in 
the  model  should  instead  be  learned  from  visual  expe¬ 
rience.  We  have  extended  the  standard  model  by  show¬ 
ing  how  to  learn  a  vocabulary  of  visual  features  from 
images  and  applying  it  to  the  recognition  of  real-world 
object  categories. 

2  The  C2  features 

It  is  important  to  stress  that  biology  imposes  strong  con¬ 
straints  on  our  system  architecture:  consistent  with  the 
standard  view  in  neuroscience,  our  architecture  is  feed¬ 
forward  and  does  not  involve  image  scanning  over  all 
positions  and  sizes,  the  standard  approach  in  computer 
vision.  It  also  limits  the  basic  operations  that  can  be  per¬ 
formed  by  individual  units. 

Our  system  is  summarized  in  Fig.  1:  the  first  two  lay¬ 
ers  correspond  to  primate  primary  visual  cortex,  VI, 
i.e.,  the  first  visual  cortical  stage,  which  contains  sim¬ 
ple  (SI)  and  complex  (Cl)  cells  [18].  The  SI  responses 
are  obtained  by  applying  to  the  input  image  a  battery  of 
Gabor  filters,  which  can  be  described  by  the  following 
equation: 


G(x,  y)  =  exp 


/  (X2+72Y2)\ 

V  2^  ) 


X  cos 


where  A'  =  x  cos  8  +  y  sin  8  and  Y  =  —x  sin  8  +  y  cos  8. 

We  adjusted  the  four  filters  parameters,  i.e.,  orienta¬ 
tion  8,  aspect  ratio  7,  effective  width  a,  and  wavelength 
A,  so  that  SI  units  tuning  profiles  match  those  of  VI 
parafoveal  simple  cells.  This  was  done  by  first  sampling 
the  space  of  the  parameters  and  then  generating  a  large 
number  of  filters.  We  applied  those  filters  to  stimuli 
commonly  used  to  assess  VI  neurons'  tuning  proper¬ 
ties  [18]  (i.e.,  gratings,  bars  and  edges).  After  removing 
filters  that  were  incompatible  with  biological  cells  [18], 
we  were  left  with  a  final  set  of  16  filters  at  4  orientations 
(see  table  1). 

The  next  stage  -  Cl  -  corresponds  to  complex  cells 
which  show  some  tolerance  to  shift  and  size:  complex 
cells  tend  to  have  larger  receptive  fields  (twice  as  large 
as  simple  cells),  respond  to  oriented  bars  or  edges  any¬ 
where  within  their  receptive  field  [18]  (shift  invariance) 
and  tend  to  be  more  broadly  tuned  than  simple  cells  [18] 
(scale  invariance).  Modifying  the  original  Hubei  & 
Wiesel  proposal  for  building  complex  cells  from  sim¬ 
ple  cells  through  pooling,  Riesenhuber  &  Poggio  pro¬ 
posed  a  max-like  pooling  operation  for  building  posi¬ 
tion  and  scale  tolerant  Cl  units.  In  the  meantime,  ex¬ 
perimental  evidence  in  favor  of  the  max  operation  has 
appeared  [19,  20].  Again,  parameters  governing  this 
pooling  operation  were  set  so  that  Cl  units  match  com¬ 
plex  cells'  tuning  properties  as  measured  experimen¬ 
tally  (see  table  1). 


Given  an  input  image,  perform  the  following  steps: 

SI:  Apply  a  battery  of  Gabor  filters  to  the  input  im¬ 
age.  The  filters  come  in  4  orientations  8  and  16  scales 
s  (see  table  1).  Obtain  16  x  4  =  64  maps  (ST)g  that  are 
are  arranged  in  8  bands  (e.g.,  band  1  contains  filters 
outputs  of  size  7  and  9,  in  all  four  orientations). 

Cl:  For  each  band,  we  take  the  max  over  scales  and 
positions:  each  band  member  is  sub-sampled  by  tak¬ 
ing  the  max  over  a  grid  with  cells  of  size  N}:  first  and 
the  max  between  the  two  members  second,  e.g.,  for 
band  1,  a  spatial  max  is  taken  over  an  8  x  8  grid  first 
and  then  across  the  two  scales  (size  7  and  9). 

Note:  We  do  not  take  a  max  over  different  orienta¬ 
tions,  hence,  each  band  (Cl)2  contains  4  orientation 
maps. 


During  training  Only:  Extract  K  patches 
Pi=i....K  of  various  sizes  n,  x  n ,  and  all  four  ori¬ 
entations  (thus  containing  n,  x  n,  x  4  elements) 
from  the  (Cl)5'  maps  from  all  training  images. 

S2:  For  image  patches  X  at  all  positions  from  Cl  im¬ 
age  (Cl)s,  compute:  Y  =  exp(— ^\\X  —  Pj\\2)  for  each 
band  and  each  Pj  independently. 

Obtain  the  S2  maps  (S2)f. 

C2:  Compute  the  max  over  all  positions  and  scales 
for  each  patch  Pj  and  obtain  shift  and  scale  invariant 
C2  features  (C 2)., ,  for  i  =  1 ...  K. 


Figure  1:  Computing  the  C2  features. 


—  ci  ci  — 


SI  afferent 


o  Strongest  SI  afferent 


Figure  2:  How  scale  and  position  tolerance  is  gained  at  the  Cl 
level:  Each  Cl  unit  receives  inputs  from  SI  units  at  the  same 
orientation  (e.g.,  0°)  arranged  in  bands.  For  each  orientation, 
a  band  E  contains  SI  units  in  two  different  sizes  and  various 
positions  (grid  cell  of  size  Ns  x  XE).  From  each  grid  cell 
(see  left  side)  we  obtain  one  measurement  by  taking  the  max¬ 
imum  over  all  positions:  this  allow  the  Cl  unit  to  respond  to 
an  horizontal  bar  anywhere  within  the  grid,  thus  providing 
a  translation-tolerant  representation.  Similarly,  taking  a  max 
over  the  two  sizes  (see  right  side),  enables  the  Cl  unit  to  be 
more  tolerant  to  changes  in  scale. 
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Table  1:  Summary  of  parameters  used  in  our  implementation  (see  also  Fig  1  and  accompanying  text). 


Band  E 

1 

2 

3 

4 

5 

6 

7 

8 

filters  sizes  s 

7  &  9 

11  &  13 

15  &  17 

19  &  21 

23  &  25 

27  &  29 

31  &33 

35  &  37 

effective  width  a 

2.8  &  3.6 

4.5  &  5.4 

6.3  &  7.3 

8.2  &  9.2 

10.2  &  11.3 

12.3  &  13.4 

14.6  &  15.8 

17.0  &  18.2 

wavelength  A 

3.5  &  4.6 

5.6  &  6.8 

7.9  &  9.1 

10.3  &  11.5 

12.7  &  14.1 

15.4  &  16.8 

18.2  &  19.7 

21.2  &  22.8 

grid  size  ArE 

8 

10 

12 

14 

16 

18 

20 

22 

orientation  6 

H.  7T  .  7T  ,  3"7T 

4)  2’  4 

patch  sizes  n, 

4  x  4;  8  x  8;  12  x  12;  16  x  16  (x4  orientations) 

Fig.  2  illustrates  how  pooling  from  SI  to  Cl  is  done. 
For  instance,  consider  the  first  band:  S  =  1.  For  each 
orientation,  it  contains  two  SI  maps:  the  one  obtained 
using  a  filter  of  size  7,  and  the  one  obtained  using  a 
filter  of  size  9.  Note  that  both  of  these  SI  maps  have 
the  same  dimensions.  In  order  to  obtain  the  Cl  re¬ 
sponses,  these  maps  are  sub-sampled  using  a  grid  cell 
of  size  N  x  =8x8.  From  each  grid  cell  we  ob¬ 
tain  one  measurement  by  taking  the  maximum  of  all  64 
elements.  As  a  last  stage  we  take  a  max  over  the  two 
scales,  by  considering  for  each  cell  the  maximum  value 
from  the  two  maps.  This  process  is  done  for  each  of  the 
four  orientations  and  each  scale  band  independently. 

In  our  new  version  of  the  standard  model  the  subsequent 
S2  stage  is  where  learning  occurs.  A  large  pool  of  patches 
of  various  sizes  and  at  random  positions  are  extracted 
from  a  target  set  of  images  at  the  level  of  the  Cl  layer 
for  all  orientations,  i.e.,  a  patch  P  of  size  n  x  n  contains 
n  x  n  x  4  elements.  The  training  process  ends  by  set¬ 
ting  each  of  those  patches  as  prototypes  or  centers  of  the 
S2  units  (at  each  position  and  scale)  which  behave  as  ra¬ 
dial  basis  function  (RBF)  units  during  recognition.  This 
is  consistent  with  well-known  neurons'  response  prop¬ 
erties  in  primate  inferotemporal  cortex  [21]  and  seems 
to  be  the  key  property  for  learning  to  generalize  in  the 
visual  and  motor  systems  ??.  Each  S2  unit  response  de¬ 
pends  in  a  Gaussian-like  way  on  the  Euclidean  distance 
between  a  new  input  and  the  stored  prototype. 

An  important  question  for  both  neuroscience  and 
computer  vision  regards  the  choice  of  the  unlabeled  tar¬ 
get  set  from  which  to  learn  -  in  an  unsupervised  way  - 
this  vocabulary  of  visual  features.  In  the  remainder  of 
this  paper,  features  are  learned  from  the  positive  train¬ 
ing  set  for  each  object,  but  the  reader  can  refer  to  sec¬ 
tion  6  for  a  discussion  on  how  features  can  be  learned 
from  natural  images. 

Our  final  set  of  shift  and  scale  invariant  C2  responses 
is  computed  by  taking  a  global  max  over  all  scales  and 
positions  for  each  S2  type  at  each  position  on  the  S2  lat¬ 
tice.  This  results  in  as  many  C2  features  as  patches  we 
extracted  during  the  learning  stage.  Finally,  in  the  com¬ 
puter  system  described  here,  the  C2  responses  to  a  new 
input  image  are  passed  to  a  classifier  for  final  analysis*. 


3  Experimental  Setup 

To  demonstrate  the  quality  of  the  C2  features,  we 
compared  their  performances  -  when  used  as  in¬ 
puts  to  a  classifier  -  with  other  systems  on  a  vari¬ 
ety  of  databases.  Datasets  that  were  available  online 
at  www.vision.caltech.edu  include  five  (Caltech) 
databases  (i.e.,  frontal-face,  motorcycle,  rear-car  and  air¬ 
plane  datasets  from  [5]  and  a  leaf  dataset  from  [8])  and 
a  101-object  database  from  [6].  We  also  considered  two 
more  challenging  datasets:  a  near-frontal  (±30° )  face 
dataset  from  [7]  provided  by  Heisele  et  al.  and  a  new 
multi-view  car  dataset  that  we  collected.  Fig.  4  shows 
some  sample  image  patterns  taken  from  the  car  and  the 
face  dataset. 

For  the  Caltech  datasets,  positive  training  and  test 
sets  were  generated  using  the  splits  provided  by  Fergus 
et  al. .  The  negative  training  and  test  sets  were  randomly 
generated  from  the  same  background  images  as  in  [5]. 
All  results  we  report  for  the  101-object  category  were 
generated  with  10  random  splits  each  using  50  train¬ 
ing  and  50  test  negative  examples  from  the  same  back¬ 
ground  image  category  as  in  [6].  For  testing,  we  also 
used  50  positive  test  examples  and  experimented  with 
different  training  set  sizes  (1,  3,  15,  30).  All  splits  for 
the  near-frontal  face  database  were  identical  to  the  ones 
used  in  [7]. 

The  face  dataset  contains  about  6,900  positive  and 
13,700  negative  images  for  training  and  427  positive  and 
5,000  negative  images  for  testing.  The  car  dataset  con¬ 
tains  4,000  positive  and  1,600  negative  training  exam¬ 
ples  and  1,700  test  examples  (both  positive  and  nega¬ 
tive).  Although  benchmark  algorithms  were  trained  on 
the  full  sets  and  the  results  reported  accordingly,  our 
system  only  used  a  subset  of  the  training  sets  (500  ex¬ 
amples  of  each  class  only). 

‘While  it  would  be  straightforward  to  match  our  final  classifier  with 
prefrontal  (PFC)  cortex  and  C2  units  with  anterior  inferotemporal 
(AIT)  cortex  [15,  22],  it  is  more  difficult  to  commit  to  a  brain  area 
for  S2  units.  Considering  their  size  and  complexity,  they  could  be 
located  in  V4  and/or  posterior  inferotemporal  (PIT)  cortex.  This 
reflects  the  lack  of  a  precise  characterization  for  neurons  in  inter¬ 
mediate  brain  areas,  i.e.,  between  primary  visual  cortex  (SI  and  Cl 
layers)  and  AIT  (C2  layer). 
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Figure  4:  Examples  taken  from  our  difficult  multi-view  car 
dataset  and  the  difficult  face  datasets  used  in  [7]. 


Figure  3:  Examples  of  learned  features  (first  5  features  re¬ 
turned  by  gentleBoost  for  each  category).  Each  ellipse  charac¬ 
terizes  a  Cl  afferent  at  matching  orientation,  while  color  en¬ 
codes  for  response  strength. 


It  is  important  to  point  out  that  those  two  datasets 
are  challenging.  The  face  patterns  used  for  testing  are 
a  subset  of  the  CMU  PIE  database  which  contains  a 
large  variety  of  faces  under  extreme  illumination  con¬ 
ditions  (see  [7]).  The  test  non-face  patterns  were  se¬ 
lected  by  a  low-resolution  LDA  classifier  as  the  most 
similar  to  faces  (the  LDA  classifier  was  trained  on  an 
independent  19  x  19  low-resolution  training  set).  The 
car  database  was  created  by  taking  street  scene  pic¬ 
tures  in  the  Boston  area.  Numerous  vehicles  (includ¬ 
ing  SUVs,  trucks,  buses,  etc  )  were  manually  labeled 
from  those  images  to  form  a  positive  test  set.  Random 
image  patterns  at  various  scales  that  were  not  labeled 
as  vehicles  were  extracted  and  used  as  a  negative  test 
set.  For  benchmarking  this  dataset,  we  implemented  a 
fragment-based  gentleBoost  algorithm  as  in  [10],  as  well 
as  a  gray-value  single-template  linear  SVM. 

As  a  preprocessing  step  to  our  system,  we  normal¬ 
ized  images  in  size:  all  images  from  the  Caltech  web 
site  were  rescaled  to  be  140  pixels  in  height  (width  was 
rescaled  accordingly  so  that  the  image  aspect  ratio  was 
preserved)  and  converted  to  gray  values.  Images  from 
the  face  database  [7]  were  all  70  x  70  pixels  and  images 
from  the  car  database  were  scaled  down  to  120  x  120 
pixels. 

In  the  remainder  of  this  paper,  to  make  past  and  fu¬ 
ture  comparisons  with  other  systems  easier,  we  report 
two  accuracy  measures  for  our  system:  the  Receiver 
Operator  Characteristic  area  ( Area  in  Fig.  5)  that  corre¬ 
sponds  to  the  area  under  the  curve  and  the  error  rate 
at  equilibrium  point  ( Eq  pt  in  Fig.  5),  i.e.,  when  the  false 
positive  rate  equals  the  miss  rate,  since  both  measures 
are  reported  in  the  literature  equally  frequently. 


4  Results 

Figure  5  contains  a  summary  of  the  performance  ex¬ 
hibited  by  the  C2  features  used  in  conjunction  with 
linear  SVM  (C2  +  (linear)  SVM)  and  gentleBoost  [3] 
(C2  +  gentleBoost)  for  various  datasets,  along  with 
published  results  from  other  systems  ( Benchmark  algo¬ 
rithms).  Results  obtained  with  the  C2  features  are  con¬ 
sistently  higher  than  those  previously  reported  on  all 
the  datasets  we  tested:  the  leaf  database  [8],  rear-car, 
frontal-face,  motorcycle  and  airplane  datasets  [5],  as 
well  as  two  more  challenging  datasets,  that  is,  the  near- 
frontal  (±30°  rotation)  face  dataset  [7]  and  a  newly  in¬ 
troduced  multi-view  car  database. 

As  Fig.  6  (left)  shows,  after  a  critical  number  of  C2  fea¬ 
tures  (about  100),  performances  do  not  depend  strongly 
on  the  number  of  features.  For  this  experiment  we  first 
created  a  set  of  10, 000  features  total  and  randomly  se¬ 
lected  subsets  of  various  sizes.  The  results  shown  are 
the  average  of  10  independent  runs.  As  evident,  per¬ 
formance  could  still  be  improved  when  allowing  more 
features  ( e.g .,  the  whole  set  of  10,000),  but  reasonable 
performance  can  be  obtained  even  with  50  features. 

Our  system  seems  to  outperform  the  component- 
based  system  presented  in  [7]  using  a  hierarchy  of 
SVMs  on  the  difficult  face  database.  It  also  seems  to 
outperform  a  system  similar  to  [10]  based  on  Ullman's 
features  [9]  and  gentleBoost  on  the  difficult  car  database 
(though  it  was  trained  using  a  much  smaller  training 
set).  For  illustration,  we  show  on  Fig.  6  (right),  the  ROC 
curves  for  both  systems  (C2-based  and  fragment-based 
with  gentleBoost)  and  a  single-template  linear  SVM. 

Fig.  7  and  8  summarize  the  results  we  obtained  on 
the  101-object  database.  For  each  object  category,  we 
generated  positive  training  sets  of  sizes  1,  3,  6,  15  and 
30  as  in  [6]  (10  random  splits  for  each).  The  nega¬ 
tive  training  sets,  and  the  test  sets  (both  positive  and 
negative)  all  contained  50  examples  randomly  selected. 
Fig.  7  (left)  shows  the  C2  features-based  system's  per¬ 
formances  (with  gentleBoost)  on  the  same  datasets  as 
the  ones  used  by  Fi  et  al.  for  illustration. 
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Datasets 

Benchmark  algorithms 

C2  +  gentleBoost 

C2  +  (linear)  SVM 

Ref. 

Area 

Eq  pt 

Area 

Eq  pt 

Area 

Eq  pt 

Leaves  (Caltech) 

[B] 

NA 

84.0 

99.4 

97.0 

99.5 

95.9 

Cars  (Caltech) 

[5] 

NA 

84.8 

100.0 

99.7 

100.0 

99.8 

Faces  (Caltech) 

[5] 

NA 

96.8 

99.8 

98.2 

99.8 

98.1 

Airplanes  (Caltech) 

[5] 

NA 

94.0 

99.6 

96.7 

98.8 

94.9 

Motorcycles  (Caltech) 

[5] 

NA 

95.0 

99.8 

98.0 

99.7 

97.4 

Faces 

[7] 

96.0 

90.4 

99.3 

95.9 

99.2 

95.3 

Cars 

(*) 

83.3 

75.4 

98.8 

95.1 

97.7 

93.3 

Figure  5:  Sample  results  obtained  by  classifying  the  C2  features  (1,000)  with  both  a  linear  SVM  (C2  +  (linear)  SVM)  and  gentleBoost 
(C2  +  gentleBoost )  and  comparison  with  existing  systems  ( Benchmark  algorithms).  We  report  both  the  ROC  area  (Area)  and  the  error 
rate  at  equilibrium  point  (Eq  pt).  (*)  corresponds  to  a  system  we  implemented  that  uses  Ullman's  features  [9]  and  gentleBoost  as 
in  [10]. 


Figure  6:  (left)  C2  features  performances  (with  gentleBoost)  on  various  Caltech  datasets  [5]  for  different  numbers  of  features  and 
(right)  ROC  curves  obtained  with  the  C2  features  on  the  difficult  car  dataset  for  comparison  with  a  component-based  (gentleBoost) 
system  similar  to  [10]  and  a  single-template  SVM. 


Figure  7:  C2  features  performances  with  linear  SVM  (left)  and  gentleBoost  (right)  on  sample  categories  [6]  from  the  101-object 
database  for  different  numbers  of  positive  training  examples. 
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Figure  8:  Overall  peformances  (histograms)  accross  the  101-object  categories  for  the  C2  features  with  (left)  linear  SVM  and  (right) 
gentleBoost  and  different  numbers  of  positive  training  examples. 


Sift-based  features  performance  (equilibrium  point) 


Figure  9:  Superiority  of  the  C2  vs.  SIFT-based  features  on  (left)  various  (Caltech)  datasets  for  different  numbers  of  features  and 
(right)  on  the  101-object  database  for  different  numbers  of  positive  training  examples  as  in  [6]. 


Our  gentleBoost  system  achieves  error  rates  similar 
to  the  ones  reported  in  [6],  with  very  few  training  ex¬ 
amples  (from  3  to  6)  and  tends  to  do  better  with  more 
examples.  It  seems  that  SVM  avoids  overfitting  even 
for  one  example  -  see  Fig.  7  (left)  -  and  outperforms 
gentleBoost  for  one-shot  learning  (learning  from  one  ex¬ 
ample).  However,  since  SVM  does  not  seem  to  be  able 
to  select  the  relevant  features,  its  performance  is  lower 
than  gentleBoost  as  the  number  of  training  examples  in¬ 
creases  (see  15  and  30  examples).  Fig.  8  shows  the  per¬ 
formances  of  the  gentleBoost  and  SVM  classifiers  used 
with  the  C2  features  over  all  categories  and  for  various 
training  set  sizes  (each  result  is  an  average  of  10  differ¬ 
ent  random  splits).  Each  plot  is  a  single  histogram  of  all 
101  scores,  obtained  using  a  fixed  number  of  training 
examples,  e.g.,  with  40  examples,  the  gentleBoost-based 
system  gets  95%  correct  for  42%  of  the  object  categories. 

We  also  compared  our  C2  features  to  a  system  based 
on  Lowe's  SIFT  features  [11].  For  this  comparison,  we 
neglected  all  position  information  recovered  by  Lowe's 


algorithm.  We  selected  1000  random  reference  key- 
points  from  the  training  set.  Given  a  new  image,  we 
measured  the  minimum  distance  between  all  its  key- 
points  and  the  1000  reference  key-points,  thus  obtain¬ 
ing  a  feature  vector  of  size  1000.  Note  that  Lowe  rec¬ 
ommends  using  the  ratio  of  the  distances  between  the 
nearest  and  the  second  closest  key-point  as  a  similarity 
measure.  We  found  instead  that  the  minimum  distance 
leads  to  better  performances  than  the  ratio. 

On  Fig.  9  (left)  we  compared  the  SIFT-based  features 
and  the  C2  features  on  various  Caltech  datasets  (leaf, 
motorcycle,  airplane,  car  and  face).  The  gain  in  perfor¬ 
mance  obtained  by  using  the  C2  features  relative  to  the 
SIFT-based  features  is  obvious.  This  is  true  with  gentle¬ 
Boost  -  used  for  classification  on  Fig.  9  (left)  -  but  we 
also  found  very  similar  results  with  a  linear  SVM.  Also, 
as  one  can  see  in  Fig.  9  (right),  the  C2  features  perfor¬ 
mances  (error  at  equilibrium  point)  for  each  category 
from  the  101-object  database  is  well  above  those  of  the 
SIFT-based  features  for  any  number  of  training  exam- 


7 


Figure  10:  (Left)  Multiclass  classification  on  101-object  database  with  linear  SVM  and  (right)  Object-specific  vs.  universal  features 
on  the  101-object  database. 


pies.  This  difference  was  significant  (paired  t-test  over 
all  training  sizes,  p  =  1CV78). 

Fig.  3  shows  examples  of  features  we  obtained  after 
training  for  motorcycle,  face,  airplane,  starfish  and  yin- 
yang  images.  This  figure  shows  the  five  first  features 
selected  by  the  gentleBoost  algorithm.  Recall  that  each 
feature  contains  n  x  n  x  4  elements,  where  n  is  the  num¬ 
ber  of  SI  afferents  or  patch  size  and  4  corresponds  to  the 
4  orientations.  For  visualization,  we  collapsed  all  ori¬ 
entations  onto  a  single  map,  i.e.,  each  ellipse  character¬ 
izes  a  SI  afferent  at  matching  orientation,  while  color 
encodes  for  its  response  strength.  One  should  keep  in 
mind  that  this  simplified  representation  is  inaccurate  : 
Cl  units  are  translation  and  scale  tolerant  i.e.,  their  pre¬ 
ferred  stimulus  is  not  unique.  For  simplicity,  we  represent 
an  ellipse  in  the  center  of  each  unit  but  in  practice  its 
exact  location  may  vary.  As  one  can  see  from  Fig.  3,  fea¬ 
tures  that  were  chosen  by  the  boosting  algorithm  also 
vary  widely  from  one  category  to  another  (both  in  size 
and  shape). 

Finally,  we  conducted  initial  experiments  on  the  mul¬ 
tiple  class  case.  For  this  task  we  used  the  101-object 
dataset.  We  split  each  category  into  a  training  set  of 
size  15  or  30  and  and  a  test  set  containing  the  rest  of  the 
images.  We  used  a  multiple-class  linear  SVM  to  train 
a  classifier.  The  SVM  applied  the  all-pairs  method  for 
multiple  labels  classification,  and  was  trained  on  102  la¬ 
bels  (101  categories  plus  the  background  category).  The 
number  of  C 2  features  used  in  these  experiment  was 
4075.  The  results  we  obtained,  averaged  over  10  rep¬ 
etitions  of  the  experiments,  were  35%  correct  classifi¬ 
cation  rate  when  using  15  training  examples  per  class, 
and  42%  correct  classification  rate  when  using  30  train¬ 
ing  examples.  Fig.  10  shows  the  confusion  matrix  for 
the  101-object  categories. 


5  Implications  for  Object  Recognition  in 
Cortex 

Our  approach  is  biologically  motivated  and  our  system 
belongs  to  a  family  of  feed-forward  models  of  object 
recognition  in  cortex  that  have  been  shown  to  be  able 
to  duplicate  neurons'  tuning  properties  in  several  vi¬ 
sual  cortical  areas.  In  particular,  Riesenhuber  &  Pog- 
gio  showed  that  such  a  class  of  models  accounts  quan¬ 
titatively  for  the  tuning  properties  of  view-tuned  units 
in  inferotemporal  cortex  (IT)  which  respond  to  images 
of  the  learned  object  more  strongly  than  to  distrac- 
tor  objects,  despite  significant  changes  in  position  and 
size  [22].  Riesenhuber  &  Poggio  reported  performance 
of  the  model  only  on  idealized  stimuli  such  as  paper¬ 
clips  on  a  uniform  background  [23]  (no  real-world  im¬ 
age  degradation  such  as  change  in  illumination,  clut¬ 
ter,  etc  ).  The  success  of  our  extension  of  their  origi¬ 
nal  model  on  a  variety  of  real-world  object  databases 
provides  a  compelling  plausibility  proof  for  this  class 
of  feed-forward  models. 

A  long-time  goal  for  computer  vision  has  always 
been  to  build  a  system  that  achieves  human-level  recog¬ 
nition  performance.  Until  now,  biology  had  not  sug¬ 
gested  a  good  solution.  In  fact,  the  superiority  of  hu¬ 
man  performances  over  the  best  artificial  recognition 
systems  has  continuously  lacked  a  satisfactory  expla¬ 
nation.  The  computer  vision  approaches  had  also  di¬ 
verged  from  biology:  for  instance,  some  of  the  best  ex¬ 
isting  computer  vision  systems  use  geometrical  infor¬ 
mation  about  objects  constitutive  parts  whereas  biology 
is  unlikely  to  be  able  to  use  it  -  at  least  in  the  cortical 
stream  dedicated  to  shape  processing  and  object  recog¬ 
nition.  The  system  described  in  this  paper  may  be  the 
first  counter-example  to  this  situation:  it  is  based  on  a 
model  of  object  recognition  in  cortex  [15],  it  respects 
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6  Discussion 


|  10  best  | 

|  10  worst  | 

metronome 

100.0 

chair 

73.0 

inline  skate 

99.5 

barrel 

72.1 

scissors 

98.3 

ibis 

72.1 

pagoda 

98.1 

octopus 

71.6 

trilobite 

97.9 

cup 

71.3 

faces 

97.3 

cannon 

71.1 

accordion 

97.2 

wheelchair 

70.8 

minaret 

96.2 

lamp 

70.6 

faces  easy 

95.7 

flamingo 

68.4 

car  side 

95.7 

ewer 

62.9 

Table  2:  10  best  and  10  worst  categories  (Eq  pt)  from  the 
101-object  database. 


the  properties  of  cortical  processing  (including  the  ab¬ 
sence  of  geometrical  information)  while  showing  per¬ 
formance  at  least  comparable  to  the  best  computer  vi¬ 
sion  systems. 

We  finally  show  results  suggesting  that  it  is  possible 
to  perform  robust  object  recognition  from  C2  features 
learned  from  natural  images.  In  Fig.  10,  we  compare  the 
performances  of  two  sets  of  features  on  the  101-object 
database:  (1)  a  set  of  object-specific  features  that  were 
learned  from  the  training  set  of  the  target  object  cate¬ 
gory  (20  features  per  training  image);  and  (2)  a  univer¬ 
sal  set  of  10, 000  features  that  were  learned  from  a  gen¬ 
eral  set  of  natural  images  (downloaded  from  the  web). 
While  the  object-specific  set  performs  significantly  better 
with  enough  training  examples  ( p  =  3.7  •  10-7,  paired 
t-test  for  30  training  examples),  the  situation  is  reversed 
for  small  training  sets  ( p  =  7.5  •  10  11 ,  paired  t-test  for  1 
training  example). 

This  apparent  superiority  of  the  universal  set  over  the 
object-specific  one  for  small  training  sets  is  somewhat 
counter-intuitive  and  very  interesting.  First,  the  univer¬ 
sal  feature  set  is  less  prone  to  overfitting  with  few  train¬ 
ing  examples  (recall  that  both  the  features  learning  and 
classifier  training  is  performed  on  the  same  set  in  the 
object-specific  case).  Second,  the  size  of  the  universal  set  is 
constant  regardless  of  the  number  of  training  examples 
(10,000),  while  the  size  of  the  object-specific  set  is  much 
smaller  (20  times  the  number  of  training  images). 

We  believe  that  this  represents  a  relevant  and  intrigu¬ 
ing  result  on  its  own  -  towards  the  holy  grail  of  find¬ 
ing  the  elusive  universal  dictionary  of  visual  shapes.  Our 
results  also  suggest  that  it  should  be  possible  for  bio¬ 
logical  organisms  to  acquire  a  basic  vocabulary  of  fea¬ 
tures  early  in  development  while  refining  it  with  more 
specific  features  during  adulthood.  The  latter  point  is 
consistent  with  reports  of  plasticity  in  inferotemporal 
cortex  from  adult  monkey  [22, 24]  (our  C2  features  com¬ 
plexity  and  sizes  are  consistent  with  neurons  receptive 
field  in  posterior  IT  [24]). 


In  the  present  paper  we  described  a  new  framework 
for  robust  object  recognition:  our  system  first  computes 
a  set  of  biologically-inspired  scale-  and  translation- 
invariant  C2  features  from  a  training  set  of  images.  We 
then  run  a  standard  classifier  on  the  vector  of  features 
obtained  from  the  input  image.  We  showed  that  our  ap¬ 
proach  exhibits  outstanding  performances  on  a  variety 
of  image  datasets. 

A  biologically-inspired  state-of-the-art  approach. 

While  significantly  simpler  than  other  state-of-the-art 
systems,  our  approach  nonetheless  exhibits  consis¬ 
tently  better  results  than  all  systems  we  have  compared 
it  to.  For  instance,  the  systems  described  in  [5,  6,  8] 
involve  the  estimation  of  probability  distributions; 
[7]  uses  a  hierarchy  of  SVM  classifiers  and  requires 
correspondences  between  positive  training  images  (3D 
head  models). 

We  also  showed  that  a  relatively  small  number  of  fea¬ 
tures  (about  50)  is  sufficient  to  achieve  reliable  object 
recognition.  However,  performances  can  be  increased 
significantly  by  adding  more  features.  Interestingly,  the 
number  of  features  needed  to  reach  the  plateau  (about 
5,000  features)  is  much  larger  than  the  number  used  by 
current  systems  (on  the  order  of  10-100  for  [7,  9, 10]  and 
4-8  for  constellation  approaches  [5,  6,  8]). 

On  the  role  of  relative  geometry  for  generic  object 
recognition.  It  is  also  important  to  point  out  that,  con¬ 
trary  to  recent  trends  —  but  consistent  with  neurophys¬ 
iology  constraints  -  we  do  not  model  local  object  ge¬ 
ometry.  The  constellation  approaches  [5,  6,  8]  rely  on 
a  probabilistic  shape  model;  in  [7]  the  position  of  the 
facial  components  is  passed  to  a  combination  classifier 
(along  with  their  associated  detection  values);  in  [10] 
object  parts  are  searched  only  in  their  approximated  ex¬ 
pected  position.  We  should  emphasize  that  the  absence 
of  shape  information  in  our  approach  follows  directly 
the  standard  model;  it  was  therefore  guided  by  what  we 
know  about  properties  of  visual  processing  within  the 
sequence  of  visual  areas  comprising  the  ventral  stream, 
which  is  responsible  for  object  recognition. 

Use  of  prior  vs.  use  of  negative  examples.  In  recent 
years,  generative  models  have  gained  popularity  in  ob¬ 
ject  recognition  applications.  In  particular,  it  was  re¬ 
cently  shown  that  generative  models  combined  with 
the  use  of  prior  category  information  could  produce 
systems  able  to  learn  from  few  examples  [6].  Our  sys¬ 
tem  does  not  exploit  any  prior,  but  instead  uses  a  train¬ 
ing  set  which  contains  negative  examples.  Negative 
examples  provide  extremely  useful  information  to  our 
classifier  with  little  cost  (negative  examples  are  easy  to 
obtain).  Note  that  in  the  tests  reported  here,  we  did  not 
tune  any  parameter  to  obtain  optimal  performance.  In¬ 
stead,  we  tuned  the  parameters  to  match  what  is  known 
about  the  primate  visual  system. 
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The  quest  for  universal  features.  We  finally  showed 
preliminary  results  suggesting  that  it  is  possible  to  per¬ 
form  robust  object  recognition  with  a  universal  set  of 
C2  features  learned  from  natural  images  (see  section  5). 
We  plan  on  making  this  universal  feature  set  avail¬ 
able  to  the  community  on  our  web  site  soon.  As  those 
features  were  learned  from  randomly  selected  images, 
they  might  not  all  be  useful  for  classification;  we  are 
now  studying  which  features,  out  of  this  large  set,  are 
indeed  informative. 
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